Brown Corpus について

Words near each other

・ Brown Chapel A.M.E. Church (Selma, Alabama)
・ Brown Chestnut of Lorraine
・ Brown Chimphamba
・ Brown City, Michigan
・ Brown clay (disambiguation)
・ Brown Clee Hill
・ Brown Cliffs North
・ Brown clustering
・ Brown coal mining limits in North Bohemia
・ Brown coat
・ Brown cockroach
・ Brown College
・ Brown College (Minnesota)
・ Brown College at Monroe Hill
・ Brown Company
・ Brown Corpus
・ Brown Cottage
・ Brown County
・ Brown County Airport
・ Brown County Art Colony
・ Brown County Bridge No. 36
・ Brown County Community Unit School District 1
・ Brown County Courthouse
・ Brown County Courthouse (Ohio)
・ Brown County Courthouse (Wisconsin)
・ Brown County Courthouse Historic District
・ Brown County High School
・ Brown County Reforestation Camp
・ Brown County State Bank
・ Brown County State Park

Dictionary Lists

mini英和辞書

翻訳と辞書　辞書検索 [ 開発暫定版 ]

スポンサードリンク

Brown Corpus ：ウィキペディア英語版

Brown Corpus

The Brown University Standard Corpus of Present-Day American English (or just Brown Corpus) was compiled in the 1960s by Henry Kučera and W. Nelson Francis at Brown University, Providence, Rhode Island as a general corpus (text collection) in the field of corpus linguistics. It contains 500 samples of English-language text, totaling roughly one million words, compiled from works published in the United States in 1961.
==History==
In 1967, Kučera and Francis published their classic work ''Computational Analysis of Present-Day American English'', which provided basic statistics on what is known today simply as the ''Brown Corpus''. The Brown Corpus was a carefully compiled selection of current American English, totaling about a million words drawn from a wide variety of sources. Kučera and Francis subjected it to a variety of computational analyses, from which they compiled a rich and variegated opus, combining elements of linguistics, psychology, statistics, and sociology. It has been very widely used in computational linguistics, and was for many years among the most-cited resources in the field.
Shortly after publication of the first lexicostatistical analysis, Boston publisher Houghton-Mifflin approached Kučera to supply a million word, three-line citation base for its new ''American Heritage Dictionary''. This ground-breaking new dictionary, which first appeared in 1969, was the first dictionary to be compiled using corpus linguistics for word frequency and other information.
The initial Brown Corpus had only the words themselves, plus a location identifier for each. Over the following several years part-of-speech tags were applied. The Greene and Rubin tagging program (see under part of speech tagging) helped considerably in this, but the high error rate meant that extensive manual proofreading was required.
The tagged Brown Corpus used a selection of about 80 parts of speech, as well as special indicators for compound forms, contractions, foreign words and a few other phenomena, and formed the basis for many later corpora such as the Lancaster-Oslo-Bergen Corpus. The tagged corpus enabled far more sophisticated statistical analysis, much of it carried out by graduate student Andrew Mackie. Some of the analysis appears in ''Frequency Analysis of English Usage: Lexicon and Grammar'', by Winthrop Nelson Francis and Henry Kučera, Houghton Mifflin (January, 1983) ISBN 0-395-32250-2.
One interesting result is that even for quite large samples, graphing words in order of decreasing frequency of occurrence shows a hyperbola: the frequency of the ''n''-th most frequent word is roughly proportional to 1/''n''. Thus "the" constitutes nearly 7% of the Brown Corpus, "to" and "of" more than another 3% each; while about half the total vocabulary of about 50,000 words are ''hapax legomena'': words that occur only once in the corpus.〔Kirsten Malmkjær, ''(The Linguistics Encyclopedia )'', 2nd ed, Routledge, 2002, ISBN 0-415-22210-9, p. 87.〕 This simple rank-vs.-frequency relationship was noted for an extraordinary variety of phenomena by George Kingsley Zipf (for example, see his ''The Psychobiology of Language''), and is known as Zipf's law.
Although the Brown Corpus pioneered the field of corpus linguistics, by now typical corpora (such as the Corpus of Contemporary American English, the British National Corpus or the International Corpus of English) tend to be much larger, on the order of 100 million words.

抄文引用元・出典: フリー百科事典『ウィキペディア（Wikipedia）』
■ウィキペディアで「Brown Corpus」の詳細全文を読む

スポンサードリンク

翻訳と辞書 : 翻訳のためのインターネットリソース